Question 1 and 2

The following two answers are provided in the two pictures

Question 1 & Part A of Question 2

Question 2-Part B

QUestion 3

Using the majority vote and average probability approach, what is the final classification under these two approaches?

red <- c(.1,.15,.2,.2,.55,.6,.6,.65,.7,.75)
mean(red)
## [1] 0.45
sum(red>.5)
## [1] 6

Question 4

  1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.

  2. Apply cost complexity pruning to the the large tree in order to obtain a sequence of best subtrees as a function of \(\alpha\).

  3. Use K-fold cross-validation to choose \(\alpha\). That is, divide the training observations into K-folds. For each \(k = 1,.....,k\):

    1. Repeat steps 1 and 2 on all but the \(k^{th}\) fold on the training data.

    2. Evaluate the mean squared prediction error on the data in the left-out \(k^{th}\) fold, as a function of \(\alpha\).

Average the results for each value of \(\alpha\), and pick \(\alpha\) to minimize the average error.

  1. Return the subtree from step 2 that corresponds to the chosen value of \(\alpha\).

Question 5

Part A

set.seed(10)
library(ISLR2);library(tree)
## Warning: package 'tree' was built under R version 4.1.3
# Splitting data into training and test set
train <- sample(1:nrow(Carseats), nrow(Carseats)/2)
test <- Carseats[-train,]

Part B

# Fitting a regression tree on the training set using sales as response variable
set.seed(10)
tree_carseat <- tree(Sales~.,data = Carseats, subset = train)
plot(tree_carseat)
text(tree_carseat, pretty = 1)

#output of carseat regression tree
summary(tree_carseat)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = Carseats, subset = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"  "Price"      "Age"        "CompPrice"  "Population"
## Number of terminal nodes:  14 
## Residual mean deviance:  2.378 = 442.2 / 186 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -4.33500 -1.02300  0.06757  0.00000  0.96470  3.93500
# Obtain Test MSE
set.seed(10)
seat_pred <- predict(tree_carseat,newdata = test)
mean((seat_pred-test$Sales)^2)
## [1] 5.202316

Part B Answers:

Using the regression tree I obtained 5 variables for construction with 20 terminal nodes. The best predictor according to the tree is Shelf location being the top split. Using my validation set I obtained a test MSE of 5.202 which was actually lower than my training MSE.

Part C

# Using cross validation to determine optimal level of tree complexity. 
set.seed(10)
cv_seat <- cv.tree(tree_carseat)
plot(cv_seat$size, cv_seat$dev, type = "b")

# Pruning Tree with 5 terminal nodes
prune_seat <- prune.tree(tree_carseat, best = 5)
plot(prune_seat)
text(prune_seat, pretty = 0)

#Calculating Test MSE with pruned tree

prune_yhat <- predict(prune_seat, newdata = test)
mean((prune_yhat-test$Sales)^2)
## [1] 5.00269

Part C Answer: The optimal level of complexity only had 5 terminal nodes based on the smallest cross validation error. Yes, pruning improved test MSE as you can see \(5.002 < 5.20\).

Part D

#Performing bagging approach 
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
bag_seat <- randomForest(Sales~.,data = Carseats, subset = train, mtry = 10, importance = TRUE)
bag_yhat <- predict(bag_seat, newdata = test)
mean((bag_yhat - test$Sales)^2)
## [1] 3.002559
bag_seat
## 
## Call:
##  randomForest(formula = Sales ~ ., data = Carseats, mtry = 10,      importance = TRUE, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 2.745786
##                     % Var explained: 68.41
#Using importance() function to determine most important variables
importance(bag_seat)
##                %IncMSE IncNodePurity
## CompPrice   16.7056094    137.663159
## Income       6.1726270     82.647872
## Advertising  6.7667529     61.038198
## Population   1.3436274     61.170907
## Price       52.3596661    418.926579
## ShelveLoc   75.4088192    719.467211
## Age         22.6609712    164.147571
## Education    0.5010586     37.596310
## Urban       -2.4362480      6.362525
## US           2.5465344      7.943683
varImpPlot(bag_seat)

Part D Answer:

The test MSE I obtained from bagging returned \(2.943\) which is lower than the basic regression and pruned tree. This makes sense because bagging over 500 trees eliminates variance and bias since you are bootstrapping 500 different trees using 500 bootstrapped training sets, and then average the resulting predictions. This results in a lower variance model, thus reducing test MSE.

Using the importance function, I found that Shelf-location and price of car seat are the most important variables.

#Using random forest to analyze data; I will choose m to equal 4

bag_random <- randomForest(Sales~., data = Carseats, subset = train, mtry = 4, importance = TRUE)
bag_random
## 
## Call:
##  randomForest(formula = Sales ~ ., data = Carseats, mtry = 4,      importance = TRUE, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 2.831533
##                     % Var explained: 67.42
bag_ran_yhat <- predict(bag_random, newdata = test)
mean((bag_ran_yhat-test$Sales)^2)
## [1] 3.001544
# Using importance() function

importance(bag_random)
##                %IncMSE IncNodePurity
## CompPrice   14.0562500     147.04907
## Income       2.9759045     102.36762
## Advertising  6.1148252      82.51048
## Population  -0.2430633      88.88349
## Price       37.5451519     367.81849
## ShelveLoc   52.9994335     581.38917
## Age         18.9617144     201.52419
## Education    1.1375056      59.41032
## Urban       -0.6638014      10.99337
## US           3.6517779      14.98082
varImpPlot(bag_random)

Part E Answers:

Using m = 4, meaning only four out of 10 variables can be considered at each split resulted in a test mse of 2.9901, which was higher than the bagging approach. Once again the most important variables are shelf-location and price of the car seat.

#Understanding the effect of M

bag_random_6 <- randomForest(Sales~.,data = Carseats, subset = train, mtry = 6, importance = TRUE)
bag_random_yhat <- predict(bag_random_6, newdata = test)
mean((bag_random_yhat - test$Sales)^2)
## [1] 2.902661

Part E Answers:

With four variables considered at each split I obtained a test mse of 2.963, with 6 variables the test mse was 2.9723, and using the normal bagging approach I obtained a test MSE of 2.942.

The test MSE using random forests is slightly higher than the bagged approach. This could occur because their isn’t a significant correlation between the variables in the data set, resulting in slightly higher variance in the random forest method.